A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
# To filter the warnings
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# To get diferent metric scores
from sklearn import metrics
from sklearn.metrics import accuracy_score, roc_curve, confusion_matrix, roc_auc_score,f1_score, precision_score, recall_score,precision_recall_curve,make_scorer
# To get decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To get grid search cv
from sklearn.model_selection import GridSearchCV
# Loading the dataset
from google.colab import drive
drive.mount('/content/drive')
# read the data
path="/content/drive/MyDrive/Data Science/INNHotelsGroup.csv"
innhotel_df = pd.read_csv(path)
# returns the first 5 rows
innhotel_df.head()
Mounted at /content/drive
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
# Shape of the data set
innhotel_df.shape
(36275, 19)
Inn hotel database has 36275 rows and 19 columns in the dataset
# Data type and null count of the dataset
innhotel_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
Booking_ID,type_of_meal_plan,room_type_reserved,market_segment_type,booking_status : Object data type
no_of_adults, no_of_children, no_of_weekend_nights,no_of_week_nights, required_car_parking_space,lead_time, arrival_year, arrival_month, arrival_date,repeated_guest,no_of_previous_cancellations, no_of_previous_bookings_not_canceled,no_of_special_requests- int datatype
avg_price_per_room - float datatype
# Getting missing values in the dataset
innhotel_df.isna().sum()
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
There are no null or missing values in the dataset
# Checking duplicates
innhotel_df.duplicated().sum()
0
There are no duplicates in the dataset
# Creating copy of the dataset
innhotel_df2 = innhotel_df.copy()
# Deleting booking ID column
innhotel_df.drop('Booking_ID',axis=1,inplace=True)
# Summary of the dataset
innhotel_df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| no_of_adults | 36275.0 | NaN | NaN | NaN | 1.844962 | 0.518715 | 0.0 | 2.0 | 2.0 | 2.0 | 4.0 |
| no_of_children | 36275.0 | NaN | NaN | NaN | 0.105279 | 0.402648 | 0.0 | 0.0 | 0.0 | 0.0 | 10.0 |
| no_of_weekend_nights | 36275.0 | NaN | NaN | NaN | 0.810724 | 0.870644 | 0.0 | 0.0 | 1.0 | 2.0 | 7.0 |
| no_of_week_nights | 36275.0 | NaN | NaN | NaN | 2.2043 | 1.410905 | 0.0 | 1.0 | 2.0 | 3.0 | 17.0 |
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| required_car_parking_space | 36275.0 | NaN | NaN | NaN | 0.030986 | 0.173281 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| lead_time | 36275.0 | NaN | NaN | NaN | 85.232557 | 85.930817 | 0.0 | 17.0 | 57.0 | 126.0 | 443.0 |
| arrival_year | 36275.0 | NaN | NaN | NaN | 2017.820427 | 0.383836 | 2017.0 | 2018.0 | 2018.0 | 2018.0 | 2018.0 |
| arrival_month | 36275.0 | NaN | NaN | NaN | 7.423653 | 3.069894 | 1.0 | 5.0 | 8.0 | 10.0 | 12.0 |
| arrival_date | 36275.0 | NaN | NaN | NaN | 15.596995 | 8.740447 | 1.0 | 8.0 | 16.0 | 23.0 | 31.0 |
| market_segment_type | 36275 | 5 | Online | 23214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| repeated_guest | 36275.0 | NaN | NaN | NaN | 0.025637 | 0.158053 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| no_of_previous_cancellations | 36275.0 | NaN | NaN | NaN | 0.023349 | 0.368331 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 |
| no_of_previous_bookings_not_canceled | 36275.0 | NaN | NaN | NaN | 0.153411 | 1.754171 | 0.0 | 0.0 | 0.0 | 0.0 | 58.0 |
| avg_price_per_room | 36275.0 | NaN | NaN | NaN | 103.423539 | 35.089424 | 0.0 | 80.3 | 99.45 | 120.0 | 540.0 |
| no_of_special_requests | 36275.0 | NaN | NaN | NaN | 0.619655 | 0.786236 | 0.0 | 0.0 | 0.0 | 1.0 | 5.0 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
# Checking different counts of categorical variables in dataset
for i in innhotel_df.columns:
if innhotel_df[i].dtypes == object:
print('There are',innhotel_df[i].nunique(),'Different type of', i ,'for the Innhotel and its counts in dataset as below')
display()
display(innhotel_df[i].value_counts(normalize=True))
display(innhotel_df[i].value_counts())
print('----------------------------------------------------------------------------')
There are 4 Different type of type_of_meal_plan for the Innhotel and its counts in dataset as below
Meal Plan 1 0.767333 Not Selected 0.141420 Meal Plan 2 0.091110 Meal Plan 3 0.000138 Name: type_of_meal_plan, dtype: float64
Meal Plan 1 27835 Not Selected 5130 Meal Plan 2 3305 Meal Plan 3 5 Name: type_of_meal_plan, dtype: int64
---------------------------------------------------------------------------- There are 7 Different type of room_type_reserved for the Innhotel and its counts in dataset as below
Room_Type 1 0.775465 Room_Type 4 0.166975 Room_Type 6 0.026630 Room_Type 2 0.019076 Room_Type 5 0.007305 Room_Type 7 0.004356 Room_Type 3 0.000193 Name: room_type_reserved, dtype: float64
Room_Type 1 28130 Room_Type 4 6057 Room_Type 6 966 Room_Type 2 692 Room_Type 5 265 Room_Type 7 158 Room_Type 3 7 Name: room_type_reserved, dtype: int64
---------------------------------------------------------------------------- There are 5 Different type of market_segment_type for the Innhotel and its counts in dataset as below
Online 0.639945 Offline 0.290227 Corporate 0.055603 Complementary 0.010779 Aviation 0.003446 Name: market_segment_type, dtype: float64
Online 23214 Offline 10528 Corporate 2017 Complementary 391 Aviation 125 Name: market_segment_type, dtype: int64
---------------------------------------------------------------------------- There are 2 Different type of booking_status for the Innhotel and its counts in dataset as below
Not_Canceled 0.672364 Canceled 0.327636 Name: booking_status, dtype: float64
Not_Canceled 24390 Canceled 11885 Name: booking_status, dtype: int64
----------------------------------------------------------------------------
There 36275 rows and 19 columns in the dataset.
# Getting percent on bar plot
def barplot_values_percent(ax):
heightlst = []
for i in ax.patches:
heightlst.append(i.get_height())
total = sum(heightlst)
for i in ax.patches:
x = i.get_x()+0.05 #adjust the numbers (higher numbers = to the right, lower = to the left)
height = i.get_height()+0.1 #adjust the numbers (higher numbers = up, lower = down)
value = ("{0:.2f}".format((i.get_height()/total)*100)+'%')
ax.text(x, height, value, fontsize=10,color='red')
# Getting median on box plot
import matplotlib.patheffects as path_effects
def add_median_labels(ax, fmt='.1f'):
lines = ax.get_lines()
boxes = [c for c in ax.get_children() if type(c).__name__ == 'PathPatch']
lines_per_box = int(len(lines) / len(boxes))
for median in lines[4:len(lines):lines_per_box]:
x, y = (data.mean() for data in median.get_data())
# choose value depending on horizontal or vertical plot orientation
value = x if (median.get_xdata()[1] - median.get_xdata()[0]) == 0 else y
text = ax.text(x, y, f'{value:{fmt}}', ha='center', va='center',
fontweight='bold', color='white')
# create median-colored border around white text for contrast
text.set_path_effects([
path_effects.Stroke(linewidth=3, foreground=median.get_color()),
path_effects.Normal(),
])
# Histplot,boxplot and count plot function
def hist(fea,df,kde):
sns.histplot(x=fea,data=df,kde=kde)
plt.show()
def box(fea,df):
ax = sns.boxplot(x=fea,data=df)
add_median_labels(ax)
plt.show()
def count(fea,df):
ax = sns.countplot(x=fea,data=df)
barplot_values_percent(ax)
plt.show()
# Univariate analysis for no of adults
hist('no_of_adults',innhotel_df,False)
box('no_of_adults',innhotel_df)
display(innhotel_df['no_of_adults'].value_counts(normalize=all))
2 0.719724 1 0.212130 3 0.063873 0 0.003832 4 0.000441 Name: no_of_adults, dtype: float64
As you can see 72% of the reservation has 2 adults followed by 21.21% of 1 adults and then 3 adults. There are few reservation has 0 adults and 4 adults also.
# Univariate analysis for no of children
hist('no_of_children',innhotel_df,False)
box('no_of_children',innhotel_df)
display(innhotel_df['no_of_children'].value_counts(normalize=all))
0 0.925624 1 0.044604 2 0.029166 3 0.000524 9 0.000055 10 0.000028 Name: no_of_children, dtype: float64
As you can see 92.56% reservation has 0 no_of_children followed by 4.4% has 1 children in the reservation. Maximum number of children is 10 in the reservation.
# Univariate analysis for no_of_weekend_nights
hist('no_of_weekend_nights',innhotel_df,False)
box('no_of_weekend_nights',innhotel_df)
display(innhotel_df['no_of_weekend_nights'].value_counts(normalize=all))
0 0.465114 1 0.275534 2 0.250062 3 0.004218 4 0.003556 5 0.000937 6 0.000551 7 0.000028 Name: no_of_weekend_nights, dtype: float64
As you can see 46.51% reservation has 0 weekend nights in the booking followed by 27.55% has 1 weekend night and 25% has 2 weekend night in the reservation. Maximum value for no_of_weekend_nights is 7.
# Univariate analysis for no_of_week_nights
hist('no_of_week_nights',innhotel_df,False)
box('no_of_week_nights',innhotel_df)
display(innhotel_df['no_of_week_nights'].value_counts(normalize=all))
2 0.315479 1 0.261558 3 0.216099 4 0.082426 0 0.065803 5 0.044493 6 0.005210 7 0.003115 10 0.001709 8 0.001709 9 0.000937 11 0.000469 15 0.000276 12 0.000248 14 0.000193 13 0.000138 17 0.000083 16 0.000055 Name: no_of_week_nights, dtype: float64
As you can see that 31.54% reservation has no of 2 week nights followed by 26.15% has 1 week nights reservation. Maximum value for no_of_week_nights is 17.
# Univariate analysis for type_of_meal_plan
count('type_of_meal_plan',innhotel_df)
As you can see 76.73% reservation has selected Meal plan 1 followed by 14.14% does not selected any meal plan.Almost 9.11% reservation has selected meal plan 2 and only 0.01% has selected plan3
# Univariate analysis for required_car_parking_space
hist('required_car_parking_space',innhotel_df,False)
box('required_car_parking_space',innhotel_df)
display(innhotel_df['required_car_parking_space'].value_counts(normalize=all))
0 0.969014 1 0.030986 Name: required_car_parking_space, dtype: float64
As you can see only 96.9% resevation does not show required car parking space and only 3.1% shows need of car parking space.
# Univariate analysis for room_type_reserved
plt.figure(figsize=(12, 7))
count('room_type_reserved',innhotel_df)
As you can see 77.55% made Room_type1 reservation followed by 16.7% made Room Type 4 reservation. Remaining 5 type of room type reservation are adds upto less then 6%.
# Univariate analysis for lead_time
hist('lead_time',innhotel_df,True)
box('lead_time',innhotel_df)
As you can see more then 5000 peole does not make reservation in advance.Average value for the lead time is 85.23 days and median value is 57 days. Some reservation has lead time of more then 400 days also.It shows right skewed data.
# Univariate analysis for arrival_year
hist('arrival_year',innhotel_df,False)
box('arrival_year',innhotel_df)
display(innhotel_df['arrival_year'].value_counts(normalize=all))
2018 0.820427 2017 0.179573 Name: arrival_year, dtype: float64
As you can see almost 82% data is from year 2018 and only 17.95% data is from year 2017.
What are the busiest months in the hotel?
hist('arrival_month',innhotel_df,False)
# Univariate analysis for arrival_month for year 2018
hist('arrival_month',innhotel_df[innhotel_df['arrival_year'] == 2018],False)
box('arrival_month',innhotel_df)
As you can see most people like to make reservation in the 6th and 10th month for the year 2018. There are very few reservation are in first quarter of the year.
# Univariate analysis for arrival_date
hist('arrival_date',innhotel_df,False)
box('arrival_date',innhotel_df)
As you can see data has almost same number of reservation on different days throughout the year.
Which market segment do most of the guests come from?
# Univariate analysis for market_segment_type
count('market_segment_type',innhotel_df)
As you can see 64% are online market segment type followed by 29% offline segment type for reservation. Corporate market segment is about 5.56%.
# Univariate analysis for repeated_guest
hist('repeated_guest',innhotel_df,False)
box('repeated_guest',innhotel_df)
display(innhotel_df['repeated_guest'].value_counts(normalize=all))
0 0.974363 1 0.025637 Name: repeated_guest, dtype: float64
As you can see only 2.57% reservation made by repeated guest. Most of the reservation are from new guests.
# Univariate analysis for no_of_previous_cancellations
hist('no_of_previous_cancellations',innhotel_df,False)
box('no_of_previous_cancellations',innhotel_df)
display(innhotel_df['no_of_previous_cancellations'].value_counts(normalize=all))
0 0.990682 1 0.005458 2 0.001268 3 0.001185 11 0.000689 5 0.000303 4 0.000276 13 0.000110 6 0.000028 Name: no_of_previous_cancellations, dtype: float64
Almost 99.06% reservation are not previously cancelled or are booked by new customer. Maximum number of previous cancellation is 13.
# Univariate analysis for no_of_previous_booking_not_cancelled
hist('no_of_previous_bookings_not_canceled',innhotel_df,False)
box('no_of_previous_bookings_not_canceled',innhotel_df)
As you can see almost 99% of the bookings has 0 values and maximum value is 58
# Univariate analysis for avg_price_per_room
plt.figure(figsize=(12, 7))
hist('avg_price_per_room',innhotel_df,True)
box('avg_price_per_room',innhotel_df)
Minimum avg_price_per_room is 0 and maximum value is 540 dollar. Average price is 103.42 dollar and median is 99.45 dollar. It does have kind of normal distribution but we need to treat the dollar 0 price for bookings.
# Univariate analysis for no_of_special_requests
hist('no_of_special_requests',innhotel_df,False)
box('no_of_special_requests',innhotel_df)
display(innhotel_df['no_of_special_requests'].value_counts(normalize=all))
0 0.545196 1 0.313522 2 0.120303 3 0.018608 4 0.002150 5 0.000221 Name: no_of_special_requests, dtype: float64
As you can see almost 54.5% reservation has made 0 special request followed by 31.35% has made 1 request. Some reservation has 4 and 5 requests also.
What percentage of bookings are canceled?
# Univariate analysis for booking_status
count('booking_status',innhotel_df)
As you can see almost 67.24% reservation are not cancelled and 32.76% reservation are cancelled by customer.
# Replacing not cancelled with 0 and canceled with 1 for booking status
mappings = {'Not_Canceled':0, 'Canceled':1}
innhotel_df['booking_status'] = innhotel_df['booking_status'].replace(mappings)
# Heat plot for all numeric variable
plt.figure(figsize=(12, 7))
sns.heatmap(innhotel_df.corr(), annot=True, vmin=-1, vmax=1)
plt.show()
As you can see that booking_status is in highly correlation with lead_time with factor of 0.44.It has positive correalation of 0.14 with avg_price_per_room also and 0.18 with arrival_year. avg_price_per_room is in positive correaltion with no_of_adults and no_of_children with factor of 0.3 and 0.34 respectively.
# Creating copy of dataset and converting few variables to category type
innhotel_df5 = innhotel_df.copy()
# Avg_price_per_room vs lead_time and booking_status
plt.figure(figsize=(20, 7))
sns.scatterplot(x='lead_time',y='avg_price_per_room',data=innhotel_df5,hue='booking_status')
<Axes: xlabel='lead_time', ylabel='avg_price_per_room'>
# reservation made in advance by more then 150 days
innhotel_df5[innhotel_df5['lead_time']> 150]['booking_status'].value_counts(normalize='True')
1 0.716248 0 0.283752 Name: booking_status, dtype: float64
# reservation made in advance by less then 150 days
innhotel_df5[innhotel_df5['lead_time']< 150]['booking_status'].value_counts(normalize='True')
0 0.769859 1 0.230141 Name: booking_status, dtype: float64
As you can see if booking made in well advanced then it has low average price per room but more cancellation happen when lead time is high or reservation is made in very well advance. As you can see if reservation is made more then 150 days in advance then it has 72% chance of cancellation compared with only 23% chances of cancellation when it is made less then 150 days advance.
# Avg_price_per_room vs arrival_month and year
plt.figure(figsize=(20, 7))
sns.catplot(x='arrival_month',y='avg_price_per_room',data=innhotel_df5,col='arrival_year',kind='point')
<seaborn.axisgrid.FacetGrid at 0x792b3408efe0>
<Figure size 2000x700 with 0 Axes>
As you can see that reservation cost less if that made in early quarter of the year and it goes up during summer from month 5 to 9 and then starts going down during last 2 months.When you compare the rate for the year 2018 is quiet higher then compared with the year 2017.
# Avg_price_per_room vs booking status
plt.figure(figsize=(20, 7))
sns.catplot(x='booking_status',y='avg_price_per_room',data=innhotel_df5,kind='box')
<seaborn.axisgrid.FacetGrid at 0x792b2f66c7f0>
<Figure size 2000x700 with 0 Axes>
As you can avg_price_per_room is little high for cancelled reservation compared with not cancelled reservation but not much difference.
# Total guest and booking status
innhotel_df5['total_guest'] = innhotel_df5['no_of_adults'] + innhotel_df5['no_of_children']
plt.figure(figsize=(20, 7))
sns.catplot(x='total_guest',data=innhotel_df5,hue='booking_status',kind='count')
innhotel_df5.groupby('total_guest')['booking_status'].value_counts(normalize=True)
total_guest booking_status
1 0 0.760461
1 0.239539
2 0 0.654164
1 0.345836
3 0 0.638535
1 0.361465
4 0 0.563596
1 0.436404
5 0 0.666667
1 0.333333
10 0 1.000000
11 1 1.000000
12 0 1.000000
Name: booking_status, dtype: float64
<Figure size 2000x700 with 0 Axes>
As you can see cancellation proportion is about 24% when booked by only 1 guest and when it is booked for 2 guest or 3 guest cancellation rate is almost 35%. Cancelllation rate is almost 44% when it is booked for 4 guest.
# Market_segment-type vs booking_Status
sns.countplot(x='market_segment_type',data=innhotel_df5,hue='booking_status')
innhotel_df5.groupby('market_segment_type')['booking_status'].value_counts(normalize=True)
market_segment_type booking_status
Aviation 0 0.704000
1 0.296000
Complementary 0 1.000000
Corporate 0 0.890927
1 0.109073
Offline 0 0.700513
1 0.299487
Online 0 0.634919
1 0.365081
Name: booking_status, dtype: float64
As you can see cancellation rate is high for online market segment type and very low for Corporate and Complimentary market segment type.
Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
# Market_segment-type vs avg_price_per_room
plt.figure(figsize=(20, 7))
sns.pointplot(x='market_segment_type',y='avg_price_per_room',data=innhotel_df5,hue='booking_status')
<Axes: xlabel='market_segment_type', ylabel='avg_price_per_room'>
innhotel_df5.groupby('market_segment_type')['avg_price_per_room'].mean()
market_segment_type Aviation 100.704000 Complementary 3.141765 Corporate 82.911740 Offline 91.632679 Online 112.256855 Name: avg_price_per_room, dtype: float64
As you can see avg_price_per_room is high for online market segment about dollar 112.25 followed by Aviation dollar 100.70. It is low for Corporate market segment about dollar 82.9.It is dollar 91.63 for offline market segment.It is also observed that cost for cancelled status is higher then not cancelled across all market segment type.
Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
display(innhotel_df5.groupby('repeated_guest')['booking_status'].value_counts())
display(innhotel_df5.groupby('repeated_guest')['booking_status'].value_counts(normalize=True))
ax = sns.countplot(x='repeated_guest',data=innhotel_df5[innhotel_df5['repeated_guest'] == 1],hue='booking_status')
barplot_values_percent(ax)
repeated_guest booking_status
0 0 23476
1 11869
1 0 914
1 16
Name: booking_status, dtype: int64
repeated_guest booking_status
0 0 0.664196
1 0.335804
1 0 0.982796
1 0.017204
Name: booking_status, dtype: float64
As you can see that repeated guest has 98.28% chance of not cancelling the reservation and only 1.72% chance of cancelling reservation.
Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?
sns.countplot(x='no_of_special_requests',data=innhotel_df5,hue='booking_status')
display(innhotel_df5.groupby('no_of_special_requests')['booking_status'].value_counts())
display(innhotel_df5.groupby('no_of_special_requests')['booking_status'].value_counts(normalize=True))
no_of_special_requests booking_status
0 0 11232
1 8545
1 0 8670
1 2703
2 0 3727
1 637
3 0 675
4 0 78
5 0 8
Name: booking_status, dtype: int64
no_of_special_requests booking_status
0 0 0.567932
1 0.432068
1 0 0.762332
1 0.237668
2 0 0.854033
1 0.145967
3 0 1.000000
4 0 1.000000
5 0 1.000000
Name: booking_status, dtype: float64
As you can see When the customer make special request the chances of booking cancellation goes down with that. Customer without special request has cancellation rate of 43.2% and with just 1 request the cancellation rate is about 23.7%. With 3,4 and 5 request cancellation rate is 0.
innhotel_df.isna().sum()
no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
There are no missing values
# Checking to make sure either no of adults or children is specified for reservation and it is not empty
no_peoplemiss = innhotel_df.query("no_of_adults < 0.5 and no_of_children < 0.5")
no_peoplemiss.shape
(0, 18)
None of the data has both no of adults or children missing together
# Avg price of the room has 0 so we need to treat those
innhotel_df[innhotel_df['avg_price_per_room'] <= 0].shape
(545, 18)
There are 545 rows have avg price for the room 0 so we can replace it with median price for the dataset
# Replacing 0 avg_price_per_room with median avg_price_per_room
innhotel_df.loc[innhotel_df['avg_price_per_room'] <= 0, 'avg_price_per_room'] = innhotel_df['avg_price_per_room'].median()
# Checking to make sure no data has avg_price_per_room = 0
innhotel_df[innhotel_df['avg_price_per_room'] <= 0].shape
(0, 18)
# outlier detection using boxplot
# selecting the numerical columns of data and adding their names in a list
numeric_columns = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
'no_of_week_nights', 'required_car_parking_space',
'lead_time', 'arrival_year', 'arrival_month',
'arrival_date', 'repeated_guest',
'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'avg_price_per_room', 'no_of_special_requests']
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(innhotel_df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# to find the 25th percentile and 75th percentile for the numerical columns.
Q1 = innhotel_df[numeric_columns].quantile(0.25)
Q3 = innhotel_df[numeric_columns].quantile(0.75)
IQR = Q3 - Q1 #Inter Quantile Range (75th percentile - 25th percentile)
lower_whisker = Q1 - 1.5*IQR #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5*IQR
# Percentage of outliers in each column
((innhotel_df[numeric_columns] < lower_whisker) | (innhotel_df[numeric_columns] > upper_whisker)).sum()/innhotel_df.shape[0]*100
no_of_adults 28.027567 no_of_children 7.437629 no_of_weekend_nights 0.057891 no_of_week_nights 0.893177 required_car_parking_space 3.098553 lead_time 3.669194 arrival_year 17.957271 arrival_month 0.000000 arrival_date 0.000000 repeated_guest 2.563749 no_of_previous_cancellations 0.931771 no_of_previous_bookings_not_canceled 2.238456 avg_price_per_room 3.244659 no_of_special_requests 2.097864 dtype: float64
Most of the outlier present in data looks normal except for no of children more then 3 and avg price per room more then 300.We will check how many data points have average rate more then 300$ and children more then 3
# Checking to make sure how many data point has avg_price_per_room > 300
innhotel_df[innhotel_df['avg_price_per_room'] > 300].shape
(9, 18)
# Checking to make sure how many data point has no of children > 3
innhotel_df[innhotel_df['no_of_children'] > 3].shape
(3, 18)
As we can see only 9 data points has avg_price_per_room > 300 and 3 data points has no_of_children > 3 which is possible to have so we don't have to treat this outlier.
# Creating Dummy variable for all object datatype in database
innhotel_df3 = pd.get_dummies(innhotel_df, columns=['type_of_meal_plan', 'room_type_reserved', 'market_segment_type'], drop_first=True)
innhotel_df3.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 no_of_adults 36275 non-null int64 1 no_of_children 36275 non-null int64 2 no_of_weekend_nights 36275 non-null int64 3 no_of_week_nights 36275 non-null int64 4 required_car_parking_space 36275 non-null int64 5 lead_time 36275 non-null int64 6 arrival_year 36275 non-null int64 7 arrival_month 36275 non-null int64 8 arrival_date 36275 non-null int64 9 repeated_guest 36275 non-null int64 10 no_of_previous_cancellations 36275 non-null int64 11 no_of_previous_bookings_not_canceled 36275 non-null int64 12 avg_price_per_room 36275 non-null float64 13 no_of_special_requests 36275 non-null int64 14 booking_status 36275 non-null int64 15 type_of_meal_plan_Meal Plan 2 36275 non-null uint8 16 type_of_meal_plan_Meal Plan 3 36275 non-null uint8 17 type_of_meal_plan_Not Selected 36275 non-null uint8 18 room_type_reserved_Room_Type 2 36275 non-null uint8 19 room_type_reserved_Room_Type 3 36275 non-null uint8 20 room_type_reserved_Room_Type 4 36275 non-null uint8 21 room_type_reserved_Room_Type 5 36275 non-null uint8 22 room_type_reserved_Room_Type 6 36275 non-null uint8 23 room_type_reserved_Room_Type 7 36275 non-null uint8 24 market_segment_type_Complementary 36275 non-null uint8 25 market_segment_type_Corporate 36275 non-null uint8 26 market_segment_type_Offline 36275 non-null uint8 27 market_segment_type_Online 36275 non-null uint8 dtypes: float64(1), int64(14), uint8(13) memory usage: 4.6 MB
Model evaluation criterion
We will make wrong prediction in 2 ways
1) It will be predicting customer will cancel the booking but in reality they does not cancel the booking. In this case we will not be able to provide good service to customer as we will not have enough man power to privde good service.
2) It will be predicting customer will not cancel the booking but in reality they does cancel the booking. In this case we will loose the revenue and end up in lower profit margin if we try to sell it for last minute at lower cost.
I think both the case will be important to us so
We want to maximize F1 score as FN and FP both are imporant in our case.
# independent variables
X = innhotel_df3.drop(["booking_status"], axis=1)
# dependent variable
y = innhotel_df3[["booking_status"]]
# this adds the constant term to the dataset
X = sm.add_constant(X)
# Spliting data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1,stratify=y
)
#from statsmodels.genmod.families.links import Logit
# Fitting the model
logit = sm.Logit(y_train, X_train)
lg = logit.fit()
Optimization terminated successfully.
Current function value: 0.423894
Iterations 26
# let's print the logistic regression summary
print(lg.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25364
Method: MLE Df Model: 27
Date: Thu, 02 Nov 2023 Pseudo R-squ.: 0.3298
Time: 22:14:01 Log-Likelihood: -10764.
converged: True LL-Null: -16060.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -952.0595 121.184 -7.856 0.000 -1189.575 -714.544
no_of_adults 0.0538 0.038 1.432 0.152 -0.020 0.127
no_of_children 0.0973 0.060 1.609 0.108 -0.021 0.216
no_of_weekend_nights 0.1483 0.020 7.484 0.000 0.109 0.187
no_of_week_nights 0.0375 0.012 3.062 0.002 0.014 0.062
required_car_parking_space -1.5989 0.137 -11.663 0.000 -1.868 -1.330
lead_time 0.0157 0.000 58.817 0.000 0.015 0.016
arrival_year 0.4705 0.060 7.834 0.000 0.353 0.588
arrival_month -0.0465 0.006 -7.178 0.000 -0.059 -0.034
arrival_date 0.0030 0.002 1.545 0.122 -0.001 0.007
repeated_guest -1.9350 0.758 -2.553 0.011 -3.420 -0.450
no_of_previous_cancellations 0.3472 0.102 3.413 0.001 0.148 0.547
no_of_previous_bookings_not_canceled -1.3735 0.903 -1.522 0.128 -3.143 0.396
avg_price_per_room 0.0179 0.001 23.660 0.000 0.016 0.019
no_of_special_requests -1.4826 0.030 -48.831 0.000 -1.542 -1.423
type_of_meal_plan_Meal Plan 2 0.1633 0.067 2.443 0.015 0.032 0.294
type_of_meal_plan_Meal Plan 3 32.2368 6.52e+06 4.95e-06 1.000 -1.28e+07 1.28e+07
type_of_meal_plan_Not Selected 0.2014 0.053 3.781 0.000 0.097 0.306
room_type_reserved_Room_Type 2 -0.4188 0.133 -3.154 0.002 -0.679 -0.159
room_type_reserved_Room_Type 3 1.2013 1.891 0.635 0.525 -2.506 4.908
room_type_reserved_Room_Type 4 -0.2534 0.053 -4.752 0.000 -0.358 -0.149
room_type_reserved_Room_Type 5 -0.6678 0.214 -3.114 0.002 -1.088 -0.247
room_type_reserved_Room_Type 6 -0.8170 0.153 -5.350 0.000 -1.116 -0.518
room_type_reserved_Room_Type 7 -1.3282 0.298 -4.462 0.000 -1.912 -0.745
market_segment_type_Complementary -90.2612 2.74e+13 -3.29e-12 1.000 -5.38e+13 5.38e+13
market_segment_type_Corporate -0.8404 0.276 -3.046 0.002 -1.381 -0.300
market_segment_type_Offline -1.7530 0.264 -6.642 0.000 -2.270 -1.236
market_segment_type_Online -0.0095 0.261 -0.037 0.971 -0.521 0.502
========================================================================================================
# Model performance evaluation
# Function to evaluate model performance
def score(model,train,act,desc,n):
"""
Inputs:
Used to evaluate and check model performance
model: model used to fit the data
train : training set or testing set or X
act: actual data from dataset or y
desc: just for printing to make sure test or train
n: threshold value
Outputs:
Recall,Precesion,Accuracy,F1 score
"""
pred1 = model.predict(train) > n
pred = np.round(pred1)
pc_test = precision_score(act, pred)
print("The precision score is {pc:.3f}".format(pc = pc_test))
rc_test = recall_score(act, pred)
print("The recall score is {rc:.3f}".format(rc = rc_test))
ac_test = accuracy_score(act, pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))
f1_test = f1_score(act, pred)
print("The F1 score is {f1:.3f}".format(f1 = f1_test))
# defining a function to plot the confusion_matrix of a classification model
cm = confusion_matrix(act, pred)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# Printing result
df_pred = pd.DataFrame()
df_pred["Recall"] = [rc_test]
df_pred["Precesion"] = [pc_test]
df_pred["Accuracy"] = [ac_test]
df_pred["F1_score"] = [f1_test]
print( "Result for the",desc,"model are:",'\n')
return df_pred
# Evaluating training model performance
log_reg_model_train_perf = score(lg,X_train,y_train,'Training',0.5)
log_reg_model_train_perf
The precision score is 0.739 The recall score is 0.630 The accuracy score is 0.806 The F1 score is 0.680
Result for the Training model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.630364 | 0.739112 | 0.806002 | 0.68042 |
# Evaluating testing model performance
log_reg_model_test_perf = score(lg,X_test,y_test,'Test',0.5)
log_reg_model_test_perf
The precision score is 0.735 The recall score is 0.624 The accuracy score is 0.803 The F1 score is 0.675
Result for the Test model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.623948 | 0.734808 | 0.802995 | 0.674856 |
# let's check the VIF of the predictors
vif_series = pd.Series(
[variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])],
index=X_train.columns,
dtype=float,
)
print("VIF values: \n\n{}\n".format(vif_series.sort_values()))
VIF values: room_type_reserved_Room_Type 3 1.003920e+00 arrival_date 1.007679e+00 type_of_meal_plan_Meal Plan 3 1.008016e+00 room_type_reserved_Room_Type 5 1.032550e+00 required_car_parking_space 1.034472e+00 no_of_weekend_nights 1.070613e+00 room_type_reserved_Room_Type 2 1.095021e+00 no_of_week_nights 1.096991e+00 room_type_reserved_Room_Type 7 1.097799e+00 no_of_special_requests 1.246800e+00 type_of_meal_plan_Meal Plan 2 1.283243e+00 arrival_month 1.283548e+00 type_of_meal_plan_Not Selected 1.286599e+00 no_of_previous_cancellations 1.322032e+00 no_of_adults 1.339355e+00 room_type_reserved_Room_Type 4 1.361858e+00 lead_time 1.408643e+00 arrival_year 1.429625e+00 no_of_previous_bookings_not_canceled 1.570753e+00 repeated_guest 1.749581e+00 avg_price_per_room 1.953082e+00 no_of_children 2.005270e+00 room_type_reserved_Room_Type 6 2.008492e+00 market_segment_type_Complementary 4.175391e+00 market_segment_type_Corporate 1.663647e+01 market_segment_type_Offline 6.251368e+01 market_segment_type_Online 6.949159e+01 const 3.949439e+07 dtype: float64
We can see dummy variable for market_segment_type are showing multi collinearity but we can ignore that for the time being and we will check for p values
# initial list of columns
predictors = X_train.copy()
cols = predictors.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = predictors[cols]
# fitting the model
model = sm.Logit(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
Optimization terminated successfully.
Current function value: 0.423894
Iterations 26
Optimization terminated successfully.
Current function value: 0.424666
Iterations 16
Optimization terminated successfully.
Current function value: 0.424668
Iterations 16
Optimization terminated successfully.
Current function value: 0.424677
Iterations 16
Optimization terminated successfully.
Current function value: 0.424704
Iterations 16
Optimization terminated successfully.
Current function value: 0.424742
Iterations 16
Optimization terminated successfully.
Current function value: 0.424782
Iterations 16
Optimization terminated successfully.
Current function value: 0.424949
Iterations 11
Optimization terminated successfully.
Current function value: 0.424998
Iterations 11
['const', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Offline', 'market_segment_type_Online']
# Taking available features after removing all columns with high p values
X_train3 = X_train[selected_features]
X_test3 = X_test[selected_features]
# Fitting the model
logit3 = sm.Logit(y_train, X_train3)
lg3 = logit3.fit()
Optimization terminated successfully.
Current function value: 0.424998
Iterations 11
# let's print the logistic regression summary
print(lg3.summary())
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 25392
Model: Logit Df Residuals: 25372
Method: MLE Df Model: 19
Date: Thu, 02 Nov 2023 Pseudo R-squ.: 0.3280
Time: 22:14:08 Log-Likelihood: -10792.
converged: True LL-Null: -16060.
Covariance Type: nonrobust LLR p-value: 0.000
==================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -963.8178 120.702 -7.985 0.000 -1200.389 -727.247
no_of_weekend_nights 0.1538 0.020 7.781 0.000 0.115 0.193
no_of_week_nights 0.0401 0.012 3.278 0.001 0.016 0.064
required_car_parking_space -1.5913 0.137 -11.617 0.000 -1.860 -1.323
lead_time 0.0158 0.000 59.383 0.000 0.015 0.016
arrival_year 0.4759 0.060 7.956 0.000 0.359 0.593
arrival_month -0.0474 0.006 -7.333 0.000 -0.060 -0.035
repeated_guest -3.0650 0.594 -5.159 0.000 -4.229 -1.901
no_of_previous_cancellations 0.2871 0.078 3.702 0.000 0.135 0.439
avg_price_per_room 0.0182 0.001 24.554 0.000 0.017 0.020
no_of_special_requests -1.4788 0.030 -49.144 0.000 -1.538 -1.420
type_of_meal_plan_Meal Plan 2 0.1653 0.067 2.476 0.013 0.034 0.296
type_of_meal_plan_Not Selected 0.2042 0.053 3.858 0.000 0.100 0.308
room_type_reserved_Room_Type 2 -0.3795 0.129 -2.953 0.003 -0.631 -0.128
room_type_reserved_Room_Type 4 -0.2361 0.052 -4.578 0.000 -0.337 -0.135
room_type_reserved_Room_Type 5 -0.6578 0.213 -3.083 0.002 -1.076 -0.240
room_type_reserved_Room_Type 6 -0.6696 0.120 -5.558 0.000 -0.906 -0.433
room_type_reserved_Room_Type 7 -1.2378 0.291 -4.251 0.000 -1.808 -0.667
market_segment_type_Offline -0.8637 0.100 -8.607 0.000 -1.060 -0.667
market_segment_type_Online 0.8874 0.095 9.295 0.000 0.700 1.075
==================================================================================================
# Evaluating training model performance
log_reg_model_train_perf = score(lg3,X_train3,y_train,'Training',0.5)
log_reg_model_train_perf
The precision score is 0.738 The recall score is 0.630 The accuracy score is 0.806 The F1 score is 0.680
Result for the Training model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.629763 | 0.738303 | 0.805569 | 0.679728 |
# Evaluating testing model performance
log_reg_model_test_perf = score(lg3,X_test3,y_test,'Test',0.5)
log_reg_model_test_perf
The precision score is 0.733 The recall score is 0.623 The accuracy score is 0.802 The F1 score is 0.674
Result for the Test model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.623388 | 0.732938 | 0.802169 | 0.673738 |
Converting coefficients to odds
The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
Therefore, odds = exp(b) The percentage change in odds is given as odds = (exp(b) - 1) * 100
# converting coefficients to odds
odds = np.exp(lg3.params)
# finding the percentage change
perc_change_odds = (np.exp(lg3.params) - 1) * 100
# removing limit from number of columns to display
pd.set_option("display.max_columns", None)
# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train3.columns).T
| const | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | repeated_guest | no_of_previous_cancellations | avg_price_per_room | no_of_special_requests | type_of_meal_plan_Meal Plan 2 | type_of_meal_plan_Not Selected | room_type_reserved_Room_Type 2 | room_type_reserved_Room_Type 4 | room_type_reserved_Room_Type 5 | room_type_reserved_Room_Type 6 | room_type_reserved_Room_Type 7 | market_segment_type_Offline | market_segment_type_Online | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Odds | 0.0 | 1.166227 | 1.040926 | 0.203656 | 1.015886 | 1.609466 | 0.953728 | 0.046652 | 1.332599 | 1.018405 | 0.227918 | 1.179792 | 1.226595 | 0.684213 | 0.789676 | 0.517972 | 0.511895 | 0.290028 | 0.421580 | 2.428786 |
| Change_odd% | -100.0 | 16.622697 | 4.092632 | -79.634365 | 1.588584 | 60.946585 | -4.627241 | -95.334836 | 33.259873 | 1.840518 | -77.208169 | 17.979158 | 22.659467 | -31.578671 | -21.032363 | -48.202750 | -48.810502 | -70.997222 | -57.841969 | 142.878621 |
Coefficient Interpretation:
Coefficient of no_of_weekend_nights,no_of_week_nights,lead_time, arrival_year,no_of_previous_cancellations,acerage_price_per_year,some levels of mealplan, some level of market segment are positive an increase in these will lead to increase in chances of a person cancelling the reservation.
Coefficient of required_car_parking_space,arrival_month,repeated_guest, no_of_special_requests,some level of room type and some level of market segment are negative so an increase in these will lead to decrease in chances of a person cancelling the reservation.
# Receiver operating characteristic curve
logit_roc_auc_train = roc_auc_score(y_train, lg3.predict(X_train3))
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg3.predict(X_train3))
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3273550051135727
# Evaluating training model performance with optimal threshold from AUC-ROC curve
log_reg_model_train_perf_threshold_auc_roc = score(lg3,X_train3,y_train,'Training',0.3274)
log_reg_model_train_perf_threshold_auc_roc
The precision score is 0.640 The recall score is 0.768 The accuracy score is 0.783 The F1 score is 0.699
Result for the Training model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.768362 | 0.640481 | 0.782806 | 0.698617 |
# Evaluating testing model performance
log_reg_model_test_perf_threshold_auc_roc = score(lg3,X_test3,y_test,'Testing',0.3274)
log_reg_model_test_perf_threshold_auc_roc
The precision score is 0.632 The recall score is 0.769 The accuracy score is 0.777 The F1 score is 0.694
Result for the Testing model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.769209 | 0.631737 | 0.777451 | 0.693728 |
y_scores = lg3.predict(X_train3)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "b--", label="precision")
plt.plot(thresholds, recalls[:-1], "g--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
We can use threshold of 0.42 as per the Precision-Recall curve.Let's use that and check for model performance
# Evaluating training model performance
log_reg_model_train_perf_threshold_prcurve = score(lg3,X_train3,y_train,'Training',0.42)
log_reg_model_train_perf_threshold_prcurve
The precision score is 0.699 The recall score is 0.698 The accuracy score is 0.802 The F1 score is 0.698
Result for the Training model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.69756 | 0.698904 | 0.802457 | 0.698231 |
# Evaluating testing model performance
log_reg_model_test_perf_threshold_prcurve = score(lg3,X_test3,y_test,'Testing',0.42)
log_reg_model_test_perf_threshold_prcurve
The precision score is 0.690 The recall score is 0.694 The accuracy score is 0.798 The F1 score is 0.692
Result for the Testing model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.694335 | 0.690078 | 0.797666 | 0.6922 |
# training performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_train_perf.T,
log_reg_model_train_perf_threshold_auc_roc.T,
log_reg_model_train_perf_threshold_prcurve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.327 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.327 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Recall | 0.629763 | 0.768362 | 0.697560 |
| Precesion | 0.738303 | 0.640481 | 0.698904 |
| Accuracy | 0.805569 | 0.782806 | 0.802457 |
| F1_score | 0.679728 | 0.698617 | 0.698231 |
# testing performance comparison
models_train_comp_df = pd.concat(
[
log_reg_model_test_perf.T,
log_reg_model_test_perf_threshold_auc_roc.T,
log_reg_model_test_perf_threshold_prcurve.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression-default Threshold",
"Logistic Regression-0.327 Threshold",
"Logistic Regression-0.42 Threshold",
]
print("Testing performance comparison:")
models_train_comp_df
Testing performance comparison:
| Logistic Regression-default Threshold | Logistic Regression-0.327 Threshold | Logistic Regression-0.42 Threshold | |
|---|---|---|---|
| Recall | 0.623388 | 0.769209 | 0.694335 |
| Precesion | 0.732938 | 0.631737 | 0.690078 |
| Accuracy | 0.802169 | 0.777451 | 0.797666 |
| F1_score | 0.673738 | 0.693728 | 0.692200 |
For decision tree we don't have to worry about multicollinearity or outlier.
# Creating Dummy variable for all object datatype in database without dropping first
innhotel_df4 = pd.get_dummies(innhotel_df, columns=['type_of_meal_plan', 'room_type_reserved', 'market_segment_type'])
# independent variables
X = innhotel_df4.drop(["booking_status"], axis=1)
# dependent variable
y = innhotel_df4[["booking_status"]]
# Spliting data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
# Creating a model and fitting to data
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
# Scoring on model
print("Accuracy on training set : ",dTree.score(X_train, y_train))
print("Accuracy on test set : ",dTree.score(X_test, y_test))
Accuracy on training set : 0.994210775047259 Accuracy on test set : 0.8704401359919139
# Function to evaluate model performance
def score(model,train,act,desc):
"""
Inputs:
Used to evaluate and check model performance
model: model used to fit the data
train : training set or testing set or X
pred: predicted y using model
act: actual data from dataset or y
desc: just for printing to make sure test or train
Outputs:
Recall,Precesion,Accuracy,F1 score
"""
pred = model.predict(train)
pc_test = precision_score(act, pred)
print("The precision score is {pc:.3f}".format(pc = pc_test))
rc_test = recall_score(act, pred)
print("The recall score is {rc:.3f}".format(rc = rc_test))
ac_test = accuracy_score(act, pred)
print("The accuracy score is {ac:.3f}".format(ac = ac_test))
f1_test = f1_score(act, pred)
print("The F1 score is {f1:.3f}".format(f1 = f1_test))
# defining a function to plot the confusion_matrix of a classification model
cm = confusion_matrix(act, pred)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.show()
# Printing result
df_pred = pd.DataFrame()
df_pred["Recall"] = [rc_test]
df_pred["Precesion"] = [pc_test]
df_pred["Accuracy"] = [ac_test]
df_pred["F1_score"] = [f1_test]
print( "Result for the",desc,"model are:",'\n')
return df_pred
# Evaluating training model performance
decisiontree_train_perf = score(dTree,X_train,y_train,'Training')
decisiontree_train_perf
The precision score is 0.996 The recall score is 0.987 The accuracy score is 0.994 The F1 score is 0.991
Result for the Training model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.986608 | 0.995776 | 0.994211 | 0.991171 |
# Evaluating testing model performance
decisiontree_test_perf = score(dTree,X_test,y_test,'Testing')
decisiontree_test_perf
The precision score is 0.797 The recall score is 0.804 The accuracy score is 0.870 The F1 score is 0.801
Result for the Testing model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.804373 | 0.79713 | 0.87044 | 0.800735 |
feature_names = list(X.columns)
print(feature_names)
['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'arrival_date', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 1', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Meal Plan 3', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 1', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 3', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Aviation', 'market_segment_type_Complementary', 'market_segment_type_Corporate', 'market_segment_type_Offline', 'market_segment_type_Online']
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# importance of features in the tree building
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.347194 avg_price_per_room 0.179874 market_segment_type_Online 0.094819 arrival_date 0.081914 no_of_special_requests 0.068043 arrival_month 0.064088 no_of_week_nights 0.045813 no_of_weekend_nights 0.037850 no_of_adults 0.025906 arrival_year 0.012276 required_car_parking_space 0.007353 type_of_meal_plan_Meal Plan 1 0.006648 room_type_reserved_Room_Type 4 0.005596 room_type_reserved_Room_Type 1 0.004630 type_of_meal_plan_Not Selected 0.003729 no_of_children 0.003581 type_of_meal_plan_Meal Plan 2 0.002102 room_type_reserved_Room_Type 2 0.002022 room_type_reserved_Room_Type 5 0.001631 market_segment_type_Offline 0.001287 market_segment_type_Aviation 0.000759 room_type_reserved_Room_Type 7 0.000682 room_type_reserved_Room_Type 6 0.000669 market_segment_type_Corporate 0.000515 repeated_guest 0.000483 no_of_previous_bookings_not_canceled 0.000371 no_of_previous_cancellations 0.000091 market_segment_type_Complementary 0.000075 type_of_meal_plan_Meal Plan 3 0.000000 room_type_reserved_Room_Type 3 0.000000
As you can see that 'lead_time','avg_price_per_room','market_segment_type_online','arrival_date' are some of the very imporatnt feature and contribute almost 70% to decision tree classification prediction
The tree above is very complex, it is over fitting the training data. So we need to prune the tree
Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
It is an exhaustive search that is performed on a the specific parameter values of a model.
The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(2, 7, 2),
"max_leaf_nodes": [50, 75, 150, 250],
"min_samples_split": [10, 30, 50, 70],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight='balanced', max_depth=6, max_leaf_nodes=50,
min_samples_split=10, random_state=1)# Evaluating training model performance
decisiontree_train_perf_prune = score(estimator,X_train,y_train,'Training')
decisiontree_train_perf_prune
The precision score is 0.724 The recall score is 0.786 The accuracy score is 0.831 The F1 score is 0.754
Result for the Training model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.785962 | 0.724058 | 0.830852 | 0.753741 |
# Evaluating training model performance
decisiontree_test_perf_prune = score(estimator,X_test,y_test,'Testing')
decisiontree_test_perf_prune
The precision score is 0.728 The recall score is 0.783 The accuracy score is 0.835 The F1 score is 0.754
Result for the Testing model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.783078 | 0.727513 | 0.83488 | 0.754273 |
plt.figure(figsize=(15,10))
tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
#Here we will see that importance of features has increased
Imp lead_time 0.475993 market_segment_type_Online 0.184770 no_of_special_requests 0.169335 avg_price_per_room 0.075421 no_of_adults 0.026944 no_of_weekend_nights 0.020608 arrival_month 0.014138 required_car_parking_space 0.014114 market_segment_type_Offline 0.009958 no_of_week_nights 0.007006 type_of_meal_plan_Not Selected 0.000950 arrival_date 0.000761 no_of_previous_bookings_not_canceled 0.000000 room_type_reserved_Room_Type 4 0.000000 market_segment_type_Corporate 0.000000 market_segment_type_Complementary 0.000000 market_segment_type_Aviation 0.000000 room_type_reserved_Room_Type 7 0.000000 room_type_reserved_Room_Type 6 0.000000 room_type_reserved_Room_Type 5 0.000000 room_type_reserved_Room_Type 3 0.000000 no_of_previous_cancellations 0.000000 room_type_reserved_Room_Type 2 0.000000 room_type_reserved_Room_Type 1 0.000000 arrival_year 0.000000 type_of_meal_plan_Meal Plan 3 0.000000 no_of_children 0.000000 type_of_meal_plan_Meal Plan 1 0.000000 repeated_guest 0.000000 type_of_meal_plan_Meal Plan 2 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000e+00 | 0.007572 |
| 1 | 4.327745e-07 | 0.007573 |
| 2 | 4.688391e-07 | 0.007573 |
| 3 | 5.329960e-07 | 0.007574 |
| 4 | 6.133547e-07 | 0.007575 |
| ... | ... | ... |
| 1342 | 6.665684e-03 | 0.286897 |
| 1343 | 1.304480e-02 | 0.299942 |
| 1344 | 1.725993e-02 | 0.317202 |
| 1345 | 2.399048e-02 | 0.365183 |
| 1346 | 7.657789e-02 | 0.441761 |
1347 rows × 2 columns
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.07657789477371374
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Accuracy vs alpha for training and testing sets When ccp_alpha is set to zero and keeping the other default parameters of DecisionTreeClassifier, the tree overfits, leading to a 100% training accuracy and 87% testing accuracy. As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, y_train))
print('Test accuracy of best model: ',best_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.00012004617212610687, random_state=1) Training accuracy of best model: 0.9018982356647763 Test accuracy of best model: 0.8833961223927226
f1_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.f1_score(y_train,pred_train3)
f1_train.append(values_train)
f1_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.f1_score(y_test,pred_test3)
f1_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("f1score")
ax.set_title("f1 vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0001350881814381269, random_state=1)
Confusion Matrix - post-pruned decision tree
# Evaluating training model performance
decisiontree_train_perf_postprune = score(best_model,X_train,y_train,'Training')
decisiontree_train_perf_postprune
The precision score is 0.858 The recall score is 0.822 The accuracy score is 0.897 The F1 score is 0.840
Result for the Training model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.822193 | 0.857784 | 0.896542 | 0.839612 |
# Evaluating testing model performance
decisiontree_testing_perf_postprune = score(best_model,X_test,y_test,'Testing')
decisiontree_testing_perf_postprune
The precision score is 0.834 The recall score is 0.796 The accuracy score is 0.883 The F1 score is 0.815
Result for the Testing model are:
| Recall | Precesion | Accuracy | F1_score | |
|---|---|---|---|---|
| 0 | 0.796422 | 0.833829 | 0.882753 | 0.814696 |
plt.figure(figsize=(17,15))
tree.plot_tree(best_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# importance of features in the tree building ( The importance of a feature is computed as the
#(normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print (pd.DataFrame(best_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
Imp lead_time 0.403941 avg_price_per_room 0.157766 market_segment_type_Online 0.142774 no_of_special_requests 0.100915 arrival_month 0.055791 no_of_weekend_nights 0.031373 arrival_date 0.029059 no_of_adults 0.026057 no_of_week_nights 0.017415 arrival_year 0.013908 required_car_parking_space 0.010165 type_of_meal_plan_Meal Plan 1 0.003322 room_type_reserved_Room_Type 4 0.001884 market_segment_type_Offline 0.001710 room_type_reserved_Room_Type 1 0.001345 room_type_reserved_Room_Type 5 0.001083 room_type_reserved_Room_Type 2 0.000616 no_of_children 0.000478 type_of_meal_plan_Not Selected 0.000399 type_of_meal_plan_Meal Plan 3 0.000000 repeated_guest 0.000000 room_type_reserved_Room_Type 3 0.000000 no_of_previous_bookings_not_canceled 0.000000 room_type_reserved_Room_Type 6 0.000000 room_type_reserved_Room_Type 7 0.000000 market_segment_type_Aviation 0.000000 market_segment_type_Complementary 0.000000 market_segment_type_Corporate 0.000000 no_of_previous_cancellations 0.000000 type_of_meal_plan_Meal Plan 2 0.000000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
decisiontree_train_perf.T,
decisiontree_train_perf_prune.T,
decisiontree_train_perf_postprune.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Recall | 0.986608 | 0.785962 | 0.822193 |
| Precesion | 0.995776 | 0.724058 | 0.857784 |
| Accuracy | 0.994211 | 0.830852 | 0.896542 |
| F1_score | 0.991171 | 0.753741 | 0.839612 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
decisiontree_test_perf.T,
decisiontree_test_perf_prune.T,
decisiontree_testing_perf_postprune.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Recall | 0.804373 | 0.783078 | 0.796422 |
| Precesion | 0.797130 | 0.727513 | 0.833829 |
| Accuracy | 0.870440 | 0.834880 | 0.882753 |
| F1_score | 0.800735 | 0.754273 | 0.814696 |
Logical Regression:
This model built can be used to predict the person's chance of cancelling resevation by F1 score of 69.22% with threshold of 0.42 and with precision and recall are almost 69.43% and 69% respectively.
Decision Tree Model:
This model built can be used to predict the person's chance of cancelling resevation by F1 score of 81.47% with Post Pruning method and with precision and recall are almost 83.38% and 79.64% respectively.
We can see lead_time, Avg_price_per_room, market_segment_type_online and no_of_special_request are very important feature with impotance score of 0.40,0.16,0.14 and 0.10 respectively.
Inn hotel database has 36275 rows and 19 columns in the dataset. There were no missing values or duplicate in the data.
1) Almost 72% of the reservation has 2 adults followed by 21.21% of 1 adults and then 3 adults. There are few reservation has 0 adults and 4 adults also.
2) Almost 92.56% reservation has 0 no_of_children followed by 4.4% has 1 children in the reservation. Maximum number of children is 10 in the reservation.
3) Almost 76.73% reservation has selected Meal plan 1 followed by 14.14% does not selected any meal plan.Almost 9.11% reservation has selected meal plan 2 and only 0.01% has selected plan3
4) Almost 96.9% resevation does not show required car parking space and only 3.1% shows need of car parking space.
5) Almost 77.55% made Room_type1 reservation followed by 16.7% made Room Type 4 reservation. Remaining 5 type of room type reservation are adds upto less then 6%.
6) More then 5000 guest does not made reservation in advance.Average value for the lead time is 85.23 days and median value is 57 days. Some reservation has lead time of more then 400 days also.It shows right skewed data.
7) Almost 82% data is from year 2018 and only 17.95% data is from year 2017. Most people like to make reservation in the 6th and 10th month of the year. There are very few reservation are in first quarter of the year.
8) Only 2.57% reservation made by repeated guest. Most of the reservation are from new guests.
9) Average price for the reservation is 103.42 dollar and median is 99.45 dollar. It does have kind of normal distribution.
10) Almost 54.5% reservation has made 0 special request followed by 31.35% has made 1 request. Some reservation has 4 and 5 requests also.
11) Almost 64% are online market segment type followed by 29% offline segment type for reservation. Corporate market segment is about 5.56%.
12) Almost 67.24% reservation are not cancelled and 32.76% reservation are cancelled by customer.
1) We were able to acheive highest 81.46% F1 score with Decision Tree Post Pruning method compared with Logistic regression. We were able to get 79.65% Recall and 83.34% Precesion score.All the logistic regression models and Decision tree have given a generalized performance on the training and test set.
2) If booking made in well advanced then it has low average price per room but more cancellation happen when lead time is high or reservation is made in very well advance. If reservation is made more then 150 days in advance then it has 72% more chance of cancellation compared with only 23% chances of cancellation when it is made less then 150 days advance. Inn hotel should call customers who makes reservation in too advance to make sure that they still want to keep reservation and they should keep some strict policies for last minute cancellation and inform customer about them during the reservation.
3) Cancellation rate is high for online market segment type about 36% and very low for Corporate about 10%. It is almost 30% for Aviation market segment and Complimentary market segment has no cancellation. Inn hotel should do more corporate type reservation by contacting some near by corporate offices followed by Airline market segment type to avoid revenue loss.They should come up with some tie up with corporate and Airline.
4) When the customer make special request the chances of booking cancellation goes down with that. Customer without special request has cancellation rate of 43.2% and with just 1 request the cancellation rate is about 23.7%. With 3,4 and 5 request cancellation rate is 0%. They should take proper care of customer request and ask them if they have any request during reservation so it improves rating for Inn hotel and in turn get more customer base.
5) Only 2.57% reservation made by repeated guest. Most of the reservation are from new guests.Repeated guest has 98.28% chance of not cancelling the reservation and only 1.72% chance of cancelling reservation.Inn hotel should take good care of the all the guest and ask feedback of all the customer while they leave to make sure they visit again.
6) Cancellation chances is about 24% when booked by only 1 guest and when it is booked for 2 guest or 3 guest cancellation rate is almost 35%. Cancelllation rate is almost 44% when it is booked for 4 guest. Again they should follow up with the guest with more people in reservation if they have any special request and offer some complemetary stuff.
7) As you can see that reservation cost less if that made in early quarter of the year and it goes up during summer from month 5 to 9 and then starts going down during last 2 months.When you compare the rate for the year 2018 is quiet higher then compared with the year 2017.Inn hotel should make sure they have enough people to take care of the guest during the month 5 to 9 of the year so that they can give good customer service. During the first quarter and last 2 month of the year they do not need lot of staff as no of reservations are less.
8) Booking_status is in highly correlation with lead_time with factor of 0.44.It has positive correalation of 0.14 with avg_price_per_room also and 0.18 with arrival_year. avg_price_per_room is in positive correaltion with no_of_adults and no_of_children with factor of 0.3 and 0.34 respectively.